Customer personality analysis is a crucial tool that businesses can use to gain a deeper understanding of their target customers. By analyzing customer data, businesses can identify customer segments and their unique preferences, motivations, and behaviors. This information can then be used to modify products and services to better suit the needs of each customer segment.
For instance, instead of launching a product and marketing it to all customers in the company's database, businesses can use customer personality analysis to identify which customer segment is most likely to buy the product. They can then tailor their marketing strategies to target that specific segment, resulting in higher conversion rates and a better return on investment.
Customer personality analysis involves gathering data on factors such as demographics, psychographics, and past behavior. This data is then analyzed to identify patterns and insights that can help businesses make informed decisions about product development, marketing, and customer engagement. By understanding their customers at a deeper level, businesses can build stronger relationships, increase customer satisfaction, and ultimately drive growth.
The analysis is based on a dataset called "Customer Personality Analysis" (https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount) which was sourced from Kaggle and contributed by Dr. Omar Romero-Hernandez . The dataset contains a variety of attributes that help to analyze the factors to segemnt customers. This dataset has 2,240 rows of observations and 28 columns of variables. Among the variables, there are 5-character variables and 23 numerical variables.
People
ID: Customer's unique identifier Year_Birth: Customer's birth year Education: Customer's education level Marital_Status: Customer's marital status Income: Customer's yearly household income Kidhome: Number of children in customer's household Teenhome: Number of teenagers in customer's household Dt_Customer: Date of customer's enrollment with the company Recency: Number of days since customer's last purchase Complain: 1 if the customer complained in the last 2 years, 0 otherwise
Products
MntWines: Amount spent on wine in last 2 years MntFruits: Amount spent on fruits in last 2 years MntMeatProducts: Amount spent on meat in last 2 years MntFishProducts: Amount spent on fish in last 2 years MntSweetProducts: Amount spent on sweets in last 2 years MntGoldProds: Amount spent on gold in last 2 years
Promotion
NumDealsPurchases: Number of purchases made with a discount AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Place
NumWebPurchases: Number of purchases made through the company’s website NumCatalogPurchases: Number of purchases made using a catalogue NumStorePurchases: Number of purchases made directly in stores NumWebVisitsMonth: Number of visits to company’s website in the last month
Our objective is to create a machine learning model that has the ability to categorize customers. To achieve this, we will be working with a dataset that has been collected from a marketing campaign. The goal of the model is to predict the response of various customer segments to a specific product or service.
Import the required tools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import Markdown
import joblib
import missingno as msno
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
palette = sns.color_palette(['gold','#cc0000', '#ace600','#33cccc'])
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler # For handling categorical column and scaling numeric columns
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
import joblib
Setup plotly
This function is required for plotly to run properly in colab without it sometimes the plots don't render properly.
To avoid code repeatability, enable plotly in cell function is created
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
init_notebook_mode(connected=False)
Read dataset from csv file
dataset = pd.read_csv("marketing_campaign.csv")
Descriptive Analysis
dataset.shape
(2240, 29)
print(f'We see the dataset has {dataset.shape[0]} observations and {dataset.shape[1]} features.')
We see the dataset has 2240 observations and 29 features.
Review the data and sample data
dataset.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
dataset.sample(10)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 578 | 10242 | 1966 | PhD | Together | 47472.0 | 1 | 1 | 16-09-2013 | 39 | 56 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 852 | 4856 | 1957 | 2n Cycle | Married | 82347.0 | 0 | 0 | 06-11-2012 | 38 | 556 | ... | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 1252 | 5153 | 1967 | PhD | Married | 77766.0 | 0 | 1 | 22-02-2013 | 97 | 1004 | ... | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 651 | 6728 | 1985 | Graduation | Together | 29103.0 | 1 | 0 | 27-08-2012 | 38 | 16 | ... | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 220 | 1755 | 1988 | 2n Cycle | Together | 45204.0 | 1 | 0 | 11-04-2014 | 93 | 30 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1146 | 749 | 1981 | Graduation | Together | 42162.0 | 1 | 0 | 19-11-2013 | 31 | 46 | ... | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 829 | 6646 | 1984 | Graduation | Together | 16529.0 | 1 | 0 | 01-03-2014 | 23 | 3 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 58 | 8557 | 1982 | Graduation | Single | NaN | 1 | 0 | 17-06-2013 | 57 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 233 | 9617 | 1951 | Graduation | Single | 15033.0 | 0 | 0 | 16-04-2014 | 37 | 0 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2065 | 123 | 1951 | Graduation | Widow | 67046.0 | 0 | 1 | 06-01-2013 | 92 | 544 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
10 rows × 29 columns
We want to look at the datatypes and check to see if they were interpreted correctly.
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
msno.matrix(dataset);
Observations
Learning the data mathematically
dataset.describe(include = 'all')
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2240.000000 | 2240.000000 | 2240 | 2240 | 2216.000000 | 2240.000000 | 2240.000000 | 2240 | 2240.000000 | 2240.000000 | ... | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.0 | 2240.0 | 2240.000000 |
| unique | NaN | NaN | 5 | 8 | NaN | NaN | NaN | 663 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Graduation | Married | NaN | NaN | NaN | 31-08-2012 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 1127 | 864 | NaN | NaN | NaN | 12 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 5592.159821 | 1968.805804 | NaN | NaN | 52247.251354 | 0.444196 | 0.506250 | NaN | 49.109375 | 303.935714 | ... | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.013393 | 0.009375 | 3.0 | 11.0 | 0.149107 |
| std | 3246.662198 | 11.984069 | NaN | NaN | 25173.076661 | 0.538398 | 0.544538 | NaN | 28.962453 | 336.597393 | ... | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.114976 | 0.096391 | 0.0 | 0.0 | 0.356274 |
| min | 0.000000 | 1893.000000 | NaN | NaN | 1730.000000 | 0.000000 | 0.000000 | NaN | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 25% | 2828.250000 | 1959.000000 | NaN | NaN | 35303.000000 | 0.000000 | 0.000000 | NaN | 24.000000 | 23.750000 | ... | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 50% | 5458.500000 | 1970.000000 | NaN | NaN | 51381.500000 | 0.000000 | 0.000000 | NaN | 49.000000 | 173.500000 | ... | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 75% | 8427.750000 | 1977.000000 | NaN | NaN | 68522.000000 | 1.000000 | 1.000000 | NaN | 74.000000 | 504.250000 | ... | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| max | 11191.000000 | 1996.000000 | NaN | NaN | 666666.000000 | 2.000000 | 2.000000 | NaN | 99.000000 | 1493.000000 | ... | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.0 | 11.0 | 1.000000 |
11 rows × 29 columns
Look for missing values
Even though we have identified missing values for income. Let's check to review to if there are any other missing values which need to be handled
dataset.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64
We can confirm there are 24 missing values for income feature. There are no missing values for any other feature. We will be handling the missing values for income in an upcoming step
Check for duplicated records
print(f'There are {dataset.duplicated().sum()} duplicated rows in the dataset.')
There are 0 duplicated rows in the dataset.
Rename columns of dataset
This is done to make it easier to process the data. At the same time, it helps with the understanding and analysis of the data
dataset.rename(columns = {'MntGoldProds':'MntGoldProducts'}, inplace = True)
Handling the missing values for Income feature
data = dataset.dropna(axis=0)
data.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 0 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProducts 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64
Check skewness before imputation
data['Income'].skew()
6.7634873728111184
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are highly skewed.
Replacing null values with median cause the data is skewed which means there are outliers in the data
data['Income'].fillna(data['Income'].median(), inplace = True)
Correcting the datatype
Converting columns to datetime format
data['Dt_Customer'] = pd.to_datetime(dataset['Dt_Customer'])
data['Year_Birth'] = pd.to_datetime(dataset['Year_Birth'], format ='%Y')
Feature engineering
Examine unique values for various categories
data['Education'].value_counts()
Graduation 1116 PhD 481 Master 365 2n Cycle 200 Basic 54 Name: Education, dtype: int64
data['Marital_Status'].value_counts()
Married 857 Together 573 Single 471 Divorced 232 Widow 76 Alone 3 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64
print("The newest customer's enrolment date in the records:", max(dataset['Dt_Customer']))
print("The oldest customer's enrolment date in the records:", min(dataset['Dt_Customer']))
The newest customer's enrolment date in the records: 31-12-2013 The oldest customer's enrolment date in the records: 01-01-2013
We only have data of three years (2012 to 2014)
Create feature from Dt_Customer that indicates the number of days a customer is registered in the firm's database. To keep it simple, I am taking this value relative to the most recent customer in the record.
data['Date_Collected'] = '01-01-2015'
data['Date_Collected'] = pd.to_datetime(data.Date_Collected)
data['Time_Enrolled_Days'] = (data['Date_Collected'] - data['Dt_Customer']).dt.days
Create feature to extract "Age" of a customer by the "Year_Birth"
data['Age'] = (data["Dt_Customer"].dt.year.max()) - (data['Year_Birth'].dt.year)
Create feature to calculate amount of years the person has been a customer
data['Years_Customer'] = (data["Dt_Customer"].dt.year.max()) - (data['Dt_Customer'].dt.year)
data['Month_Customer'] = 12.0 * (2015 - data["Dt_Customer"].dt.year ) + (1 - data["Dt_Customer"].dt.month)
data['Days_Customer'] = (data["Dt_Customer"].max()) - (data['Dt_Customer'])
Create feature Total money spent to indicate the total amount spent by the customer in various categories.
data['TotalMoneySpent'] = data['MntWines']+data['MntFruits']+data['MntMeatProducts']+data['MntFishProducts']+data['MntSweetProducts']+data['MntGoldProducts']
Create feature for Total number of purchases made to indicate the total number of purchases made by the customer in various categories.
data['TotalNumOfPurchases'] = data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases'] + data['NumDealsPurchases']
Create feature for Total number of accepted campagins to indicate the total accpeted campaigns by the customer in various categories.
data['TotalAccCmp'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5'] + data['Response']
Deriving living situation by marital status
data["Partner"]=data["Marital_Status"].replace({"Married":"Yes", "Together":"Yes", "Absurd":"No", "Widow":"No", "YOLO":"No", "Divorced":"No", "Single":"No","Alone":"No"})
Segmenting education levels in three groups
data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})
Create feature for Total children living in the house
data["Children"] = data["Kidhome"] + data["Teenhome"]
To get further clarity of household, Creating feature indicating Family Size
data['Family_Size'] = data['Partner'].replace({'No': 1, 'Yes':2}) + data['Children']
Create a feature Is Parent to indicate parenthood status
data['Is_Parent'] = np.where(data.Children > 0, 1, 0)
Adding columns about the day, month and year cutomer joined
data['YearJoined'] = data['Dt_Customer'].dt.year
data['MonthJoined'] = data['Dt_Customer'].dt.strftime("%B")
data['DayJoined'] = data['Dt_Customer'].dt.day_name()
Dividing age into groups
data['AgeGroup'] = pd.cut(x = data['Age'], bins = [1, 17, 24, 44, 64, 150],
labels = ['Under 18','Young adult','Adult','Middel Aged','Senior Citizen'])
Dropping columns which are not required
data.drop(['Z_CostContact','Z_Revenue','Year_Birth','Dt_Customer', 'Marital_Status'], axis=1, inplace=True)
data.head()
| ID | Education | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | TotalNumOfPurchases | TotalAccCmp | Partner | Children | Family_Size | Is_Parent | YearJoined | MonthJoined | DayJoined | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | Graduate | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 25 | 1 | No | 0 | 1 | 0 | 2012 | April | Monday | Middel Aged |
| 1 | 2174 | Graduate | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 6 | 0 | No | 2 | 3 | 1 | 2014 | August | Sunday | Middel Aged |
| 2 | 4141 | Graduate | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 21 | 0 | Yes | 0 | 2 | 0 | 2013 | August | Wednesday | Middel Aged |
| 3 | 6182 | Graduate | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 8 | 0 | Yes | 1 | 3 | 1 | 2014 | October | Thursday | Adult |
| 4 | 5324 | Postgraduate | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 19 | 0 | Yes | 1 | 3 | 1 | 2014 | January | Sunday | Adult |
5 rows × 41 columns
data.columns
Index(['ID', 'Education', 'Income', 'Kidhome', 'Teenhome', 'Recency',
'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
'MntSweetProducts', 'MntGoldProducts', 'NumDealsPurchases',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5',
'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response',
'Date_Collected', 'Time_Enrolled_Days', 'Age', 'Years_Customer',
'Month_Customer', 'Days_Customer', 'TotalMoneySpent',
'TotalNumOfPurchases', 'TotalAccCmp', 'Partner', 'Children',
'Family_Size', 'Is_Parent', 'YearJoined', 'MonthJoined', 'DayJoined',
'AgeGroup'],
dtype='object')
data.shape
(2216, 41)
data['Days_Customer'] = data['Days_Customer'].dt.days.astype('int16')
data['Days_Customer']
0 971
1 125
2 472
3 65
4 321
...
2235 541
2236 61
2237 315
2238 316
2239 782
Name: Days_Customer, Length: 2216, dtype: int16
Examining Unique Values
unique_number = []
for i in data.columns:
x = data[i].value_counts().count()
unique_number.append(x)
pd.DataFrame(unique_number, index = data.columns, columns = ["Total Unique Values"])
| Total Unique Values | |
|---|---|
| ID | 2216 |
| Education | 3 |
| Income | 1974 |
| Kidhome | 3 |
| Teenhome | 3 |
| Recency | 100 |
| MntWines | 776 |
| MntFruits | 158 |
| MntMeatProducts | 554 |
| MntFishProducts | 182 |
| MntSweetProducts | 176 |
| MntGoldProducts | 212 |
| NumDealsPurchases | 15 |
| NumWebPurchases | 15 |
| NumCatalogPurchases | 14 |
| NumStorePurchases | 14 |
| NumWebVisitsMonth | 16 |
| AcceptedCmp3 | 2 |
| AcceptedCmp4 | 2 |
| AcceptedCmp5 | 2 |
| AcceptedCmp1 | 2 |
| AcceptedCmp2 | 2 |
| Complain | 2 |
| Response | 2 |
| Date_Collected | 1 |
| Time_Enrolled_Days | 662 |
| Age | 59 |
| Years_Customer | 3 |
| Month_Customer | 36 |
| Days_Customer | 662 |
| TotalMoneySpent | 1047 |
| TotalNumOfPurchases | 39 |
| TotalAccCmp | 6 |
| Partner | 2 |
| Children | 4 |
| Family_Size | 5 |
| Is_Parent | 2 |
| YearJoined | 3 |
| MonthJoined | 12 |
| DayJoined | 7 |
| AgeGroup | 5 |
data.describe(include=object).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Education | 2216 | 3 | Graduate | 1116 |
| Partner | 2216 | 2 | Yes | 1430 |
| MonthJoined | 2216 | 12 | August | 210 |
| DayJoined | 2216 | 7 | Sunday | 342 |
Martial Status
enable_plotly_in_cell()
maritalstatus = data['Partner'].value_counts()
fig = px.pie(maritalstatus,
values = maritalstatus.values,
names = maritalstatus.index,
color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textposition='inside', textinfo='percent+label',
marker = dict(line = dict(color = 'white', width = 4)))
fig.show()
Observation
Spending by Marital Status
enable_plotly_in_cell()
maritalspending = data.groupby('Partner')['TotalMoneySpent'].mean().sort_values(ascending=False)
maritalspending_df = pd.DataFrame(list(maritalspending.items()), columns=['Marital Status', 'Average Spending'])
plt.figure(figsize=(13,5))
sns.barplot(data = maritalspending_df, x="Average Spending", y="Marital Status", palette='rocket');
plt.xticks( fontsize=12)
plt.yticks( fontsize=12)
plt.xlabel('Average Spending', fontsize=13, labelpad=13)
plt.ylabel('Marital Status', fontsize=13, labelpad=13);
Observation
Child Analysis
enable_plotly_in_cell()
children = data['Children'].value_counts()
fig = px.pie(children,
values = children.values,
names = children.index,
color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textposition='inside', textinfo='percent+label',
marker = dict(line = dict(color = 'white', width = 2)))
fig.show()
Observation
Average Child Wise Spendings
enable_plotly_in_cell()
childrenspending = data.groupby('Children')['TotalMoneySpent'].mean().sort_values(ascending=False)
childrenspending_df = pd.DataFrame(list(childrenspending.items()), columns=['Num of Children', 'Average Spending'])
plt.figure(figsize=(10,5))
sns.barplot(data=childrenspending_df, x="Num of Children", y="Average Spending", palette='rocket_r');
plt.xticks( fontsize=12)
plt.yticks( fontsize=12)
plt.xlabel('Num of Children', fontsize=13, labelpad=13)
plt.ylabel('Average Spending', fontsize=13, labelpad=13);
sns.pairplot(data , vars=['TotalMoneySpent','Income','Age'] , hue='Children', palette='husl');
Observation
Income distribution
plt.figure(figsize=(13,8))
sns.scatterplot(x=data[data['Income']<600000]['TotalMoneySpent'], y=data[data['Income']<600000]['Income'], color='#cc0000');
Observation
Age Distribution
enable_plotly_in_cell()
plt.figure(figsize=(10,5))
ax = sns.histplot(data = data['Age'], color='salmon')
ax.set(title = "Age Distribution of Customers");
plt.xticks( fontsize=12)
plt.yticks( fontsize=12)
plt.xlabel('Age ', fontsize=13, labelpad=13)
plt.ylabel('Counts', fontsize=13, labelpad=13);
Observation
Age analysis with spending
plt.figure(figsize=(13,8))
sns.scatterplot(x=data['TotalMoneySpent'], y=data['Age']);
Observation
1.There doesn't seem to be any clear relationship between age of customers and their spending habits.
Age group analysis
enable_plotly_in_cell()
agegroup = data['AgeGroup'].value_counts()
fig = px.pie(labels = agegroup.index, values = agegroup.values, names = agegroup.index, width = 550, height = 550)
fig.update_traces(textposition = 'inside',
textinfo = 'percent + label',
hole = 0.4,
marker = dict(colors = ["red", "green", "blue", "goldenrod", "magenta"],
line = dict(color = 'white', width = 2)))
fig.update_layout(annotations = [dict(text = 'Age Groups',
x = 0.5, y = 0.5, font_size = 20, showarrow = False,
font_color = 'black')],
showlegend = False)
fig.show()
Observation
Age group analysis with spending
agegroupspending = data.groupby('AgeGroup')['TotalMoneySpent'].mean().sort_values(ascending=False)
agegroupspending_df = pd.DataFrame(list(agegroup.items()), columns=['Age Group', 'Average Spending'])
plt.figure(figsize=(20,10))
sns.barplot(data = agegroupspending_df, x="Average Spending", y='Age Group', palette='rocket_r');
plt.xticks( fontsize=16)
plt.yticks( fontsize=16)
plt.xlabel('Average Spending', fontsize=20, labelpad=20)
plt.ylabel('Age Group', fontsize=20, labelpad=20);
Observations
Education analysis
data['Education'].value_counts().plot.pie(explode=[0.1,0,0], autopct='%1.1f%%', shadow=True, figsize=(8,8), colors=sns.color_palette("bright"));
enable_plotly_in_cell()
fig = px.histogram(data, x = 'Education',
opacity = 1, barmode = 'overlay',
width = 800, height = 500,
color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(title='Education Level')
fig.show()
plt.figure(figsize=(13,8))
sns.histplot(x=data['TotalMoneySpent'], hue=data['Education']);
Observation
Most Bought Products
products = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProducts']]
product_means = products.mean(axis=0).sort_values(ascending=False)
product_means_df = pd.DataFrame(list(product_means.items()), columns=['Product', 'Average Spending'])
plt.figure(figsize=(15,10))
plt.title('Average Spending on Products')
sns.barplot(data=product_means_df, x='Product', y='Average Spending', palette='rocket_r');
plt.xlabel('Product', fontsize=20, labelpad=20)
plt.ylabel('Average Spending', fontsize=20, labelpad=20);
Observation
The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance, therefore we should see there are ouliers in the data
plt.figure(figsize=(13,8))
sns.distplot(data.Age, color='purple');
plt.figure(figsize=(13,8))
sns.distplot(data.Income, color='Yellow');
plt.figure(figsize=(13,8))
sns.distplot(data.TotalMoneySpent, color='#ff9966');
enable_plotly_in_cell()
fig = make_subplots(rows=1, cols=3)
fig.add_trace(go.Box(y=data['Age'], notched=True, name='Age', marker_color = '#6699ff',
boxmean=True, boxpoints='suspectedoutliers'), 1, 2)
fig.add_trace(go.Box(y=data['Income'], notched=True, name='Income', marker_color = '#ff0066',
boxmean=True, boxpoints='suspectedoutliers'), 1, 1)
fig.add_trace(go.Box(y=data['TotalMoneySpent'], notched=True, name='Money Spent', marker_color = 'lightseagreen',
boxmean=True, boxpoints='suspectedoutliers'), 1, 3)
fig.update_layout(title_text='Box Plots for Numerical Variables')
fig.show()
data.head(1)
| ID | Education | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | TotalNumOfPurchases | TotalAccCmp | Partner | Children | Family_Size | Is_Parent | YearJoined | MonthJoined | DayJoined | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | Graduate | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 25 | 1 | No | 0 | 1 | 0 | 2012 | April | Monday | Middel Aged |
1 rows × 41 columns
numerical = ['Income', 'Recency', 'Age', 'TotalMoneySpent']
def detect_outliers(d):
for i in d:
Q3, Q1 = np.percentile(data[i], [75 ,25])
IQR = Q3 - Q1
ul = Q3+1.5*IQR
ll = Q1-1.5*IQR
outliers = data[i][(data[i] > ul) | (data[i] < ll)]
print(f'*** {i} outlier points***', '\n', outliers, '\n')
detect_outliers(numerical)
*** Income outlier points*** 164 157243.0 617 162397.0 655 153924.0 687 160803.0 1300 157733.0 1653 157146.0 2132 156924.0 2233 666666.0 Name: Income, dtype: float64 *** Recency outlier points*** Series([], Name: Recency, dtype: int64) *** Age outlier points*** 192 114 239 121 339 115 Name: Age, dtype: int64 *** TotalMoneySpent outlier points*** 1179 2525 1492 2524 1572 2525 Name: TotalMoneySpent, dtype: int64
Observation
There are some customers aged above 100. This is unlikely to happen. Let's drop those customers from data.
There are some customers who are earning more than 120,000 and some of them even more than 600,000. They are clearly the outliers in the data, so we will leave them out.
Delete the outlier points
data = data[(data['Age']<100)]
data = data[(data['Income']<600000)]
data.shape
(2212, 41)
Some categories may appear a lot in the dataset, whereas some other categories appear only in a few number of observations.
Rare values in categorical variables tend to cause over-fitting, particularly in tree based methods. Rare labels may be present in training set, but not in test set, therefore causing over-fitting to the train set. Rare labels may appear in the test set, and not in the train set. Thus, the machine learning model will not know how to evaluate it.
categorical = [var for var in data.columns if data[var].dtype=='O']
# check the number of different labels
for var in categorical:
print(data[var].value_counts() / np.float(len(data)))
print()
print()
Graduate 0.504069 Postgraduate 0.382007 Undergraduate 0.113924 Name: Education, dtype: float64 Yes 0.64557 No 0.35443 Name: Partner, dtype: float64 August 0.094937 October 0.093580 December 0.090416 March 0.089964 January 0.085443 May 0.084539 April 0.083183 February 0.083183 November 0.081374 June 0.075949 September 0.073689 July 0.063743 Name: MonthJoined, dtype: float64 Sunday 0.154611 Wednesday 0.148734 Thursday 0.141953 Tuesday 0.141953 Monday 0.139693 Friday 0.137884 Saturday 0.135172 Name: DayJoined, dtype: float64
As shown above, there is no rare category in the categorical variables.
plt.figure(figsize=(30, 30))
sns.heatmap(data.corr() , annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Observation
Choosing the features we we will be reporting by removing unnecessary features
subset = data[['Income','Kidhome','Teenhome','Age','Partner','Education']]
print('This is the data we will use for clustering:')
subset.head()
This is the data we will use for clustering:
| Income | Kidhome | Teenhome | Age | Partner | Education | |
|---|---|---|---|---|---|---|
| 0 | 58138.0 | 0 | 0 | 57 | No | Graduate |
| 1 | 46344.0 | 1 | 1 | 60 | No | Graduate |
| 2 | 71613.0 | 0 | 0 | 49 | Yes | Graduate |
| 3 | 26646.0 | 1 | 0 | 30 | Yes | Graduate |
| 4 | 58293.0 | 1 | 0 | 33 | Yes | Postgraduate |
subset.describe()
| Income | Kidhome | Teenhome | Age | |
|---|---|---|---|---|
| count | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 |
| mean | 51958.810579 | 0.441682 | 0.505877 | 45.086347 |
| std | 21527.278844 | 0.536955 | 0.544253 | 11.701599 |
| min | 1730.000000 | 0.000000 | 0.000000 | 18.000000 |
| 25% | 35233.500000 | 0.000000 | 0.000000 | 37.000000 |
| 50% | 51371.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 68487.000000 | 1.000000 | 1.000000 | 55.000000 |
| max | 162397.000000 | 2.000000 | 2.000000 | 74.000000 |
We aren not scaling the kidhome, teenhome cols, cause their min, max lies between 0 & 2
num_cols = ['Income','Age']
numeric_pipeline = make_pipeline(StandardScaler())
ord_cols = ['Education']
ordinal_pipeline = make_pipeline(OrdinalEncoder(categories=[['Undergraduate','Graduate','Postgraduate']]))
nom_cols = ['Partner']
nominal_pipeline = make_pipeline(OneHotEncoder())
Transform pipelines
transformer = ColumnTransformer(transformers=[('num',numeric_pipeline,num_cols),
('ordinal', ordinal_pipeline,ord_cols),
('nominal' ,nominal_pipeline,nom_cols)
])
transformer
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('standardscaler',
StandardScaler())]),
['Income', 'Age']),
('ordinal',
Pipeline(steps=[('ordinalencoder',
OrdinalEncoder(categories=[['Undergraduate',
'Graduate',
'Postgraduate']]))]),
['Education']),
('nominal',
Pipeline(steps=[('onehotencoder',
OneHotEncoder())]),
['Partner'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. ColumnTransformer(transformers=[('num',
Pipeline(steps=[('standardscaler',
StandardScaler())]),
['Income', 'Age']),
('ordinal',
Pipeline(steps=[('ordinalencoder',
OrdinalEncoder(categories=[['Undergraduate',
'Graduate',
'Postgraduate']]))]),
['Education']),
('nominal',
Pipeline(steps=[('onehotencoder',
OneHotEncoder())]),
['Partner'])])['Income', 'Age']
StandardScaler()
['Education']
OrdinalEncoder(categories=[['Undergraduate', 'Graduate', 'Postgraduate']])
['Partner']
OneHotEncoder()
Fit the transformed data
transformed = transformer.fit_transform(subset)
print('Data has been Transformed')
Data has been Transformed
# subset = data[['Income','Age','Month_Customer','TotalMoneySpent','Children']]
# print('This is the data we will use for clustering:')
# subset.head()
We wii use the Elbow method to identify the optimum clusters
from sklearn.cluster import KMeans
options = range(2,9)
inertias = []
for n_clusters in options:
model = KMeans(n_clusters, random_state=42).fit(transformed)
inertias.append(model.inertia_)
plt.figure(figsize=(20,10))
plt.title("No. of clusters vs. Inertia")
plt.plot(options, inertias, '-o', color = 'black')
plt.xticks( fontsize=16)
plt.yticks( fontsize=16)
plt.xlabel('No. of Clusters (K)', fontsize=20, labelpad=20)
plt.ylabel('Inertia', fontsize=20, labelpad=20);
Observation
Using k-means to form clusters
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42).fit(transformed)
subset['Clusters'] = kmeans.fit_predict(transformed) #fit the data and adding back clusters to the data in clusters column
subset.head()
| Income | Kidhome | Teenhome | Age | Partner | Education | Clusters | |
|---|---|---|---|---|---|---|---|
| 0 | 58138.0 | 0 | 0 | 57 | No | Graduate | 3 |
| 1 | 46344.0 | 1 | 1 | 60 | No | Graduate | 3 |
| 2 | 71613.0 | 0 | 0 | 49 | Yes | Graduate | 0 |
| 3 | 26646.0 | 1 | 0 | 30 | Yes | Graduate | 2 |
| 4 | 58293.0 | 1 | 0 | 33 | Yes | Postgraduate | 1 |
Count plot to see number of customers in each cluster
plt.figure(figsize=(12, 8))
sns.countplot(x='Clusters', data=subset, palette=palette)
plt.title('Customer Distribution Within Clusters')
plt.show()
Observations
plt.figure(figsize=(12, 8))
sns.countplot(x='Clusters', data=subset, hue='Kidhome', palette=palette)
plt.title('Customer Distribution Children (Kids)')
plt.show()
Clusters Identification
plt.figure(figsize=(12, 8))
sns.countplot(x='Clusters', data=subset, hue='Teenhome', palette=palette)
plt.title('Customer Distribution Children (Teen)')
plt.show()
plt.figure(figsize=(12, 8))
sns.countplot(x='Clusters', data=subset, hue='Partner', palette=palette)
plt.title('Customer Distribution Partner')
plt.show()
plt.figure(figsize=(12, 8))
sns.countplot(x='Clusters', data=subset, hue='Education', palette=palette)
plt.title('Customer Distribution Education')
plt.show()
Observation
Find out the customers which have kids in different clusters
plt.figure(figsize=(12, 8))
sns.barplot(x=subset["Clusters"], y=subset["Kidhome"],palette=palette)
plt.title("Kids In Household vs Clusters", size=15)
plt.show()
Observation
Find out the customers which have kids in different clusters
plt.figure(figsize=(12, 8))
sns.barplot(x=subset["Clusters"], y=subset["Teenhome"],palette=palette)
plt.title("Teens In Household vs Clusters", size=15)
plt.show()
Observation
Findout income of customers with in clusters
plt.figure(figsize=(12, 8))
sns.barplot(x=subset["Clusters"], y=subset["Income"],palette=palette)
plt.title("Income vs Clusters", size=15)
plt.show()
Observations
Contributing factor to income
plt.figure(figsize=(12, 8))
sns.barplot(x='Clusters', y='Income', data=subset, hue='Education', palette=palette)
plt.title('Cluster Income Distribution with Education')
plt.show()
plt.figure(figsize=(12, 8))
sns.barplot(x='Clusters', y='Income', data=subset, hue='Kidhome', palette=palette)
plt.title('Cluster Income Distribution with Kids')
plt.show()
plt.figure(figsize=(12, 8))
sns.barplot(x='Clusters', y='Income', data=subset, hue='Teenhome', palette=palette)
plt.title('Cluster Income Distribution with Teenhome')
plt.show()
plt.figure(figsize=(12, 8))
sns.barplot(x='Clusters', y='Income', data=subset, hue='Partner', palette=palette)
plt.title('Cluster Income Distribution with Partner')
plt.show()
Observations
Customer Profiling
They have the least income
Separate features and target column
x = subset.drop('Clusters', axis=1)
y = subset['Clusters']
Create train and test data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)
Adding GradientBoostingClassifier to transformer pipeline
final_pipeline = make_pipeline(transformer, GradientBoostingClassifier())
Fit the data to new pipeline & model
final_pipeline.fit(x_train, y_train)
Check the accuracy of our model
final_pipeline.score(x_test, y_test)
joblib.dump(final_pipeline, 'customer_segmentation_cluster.pkl')